{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Language Detection word level using rules based"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "This tutorial is available as an IPython notebook at [Malaya/example/language-detection-words](https://github.com/huseinzol05/Malaya/tree/master/example/language-detection-words).\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.87 s, sys: 3.85 s, total: 6.73 s\n",
      "Wall time: 1.97 s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n",
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import malaya"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install pyenchant\n",
    "\n",
    "pyenchant is an optional, full installation steps at https://pyenchant.github.io/pyenchant/install.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load model\n",
    "\n",
    "```python\n",
    "def substring_rules(model, **kwargs):\n",
    "    \"\"\"\n",
    "    detect EN, MS and OTHER languages in a string.\n",
    "\n",
    "    EN words detection are using `pyenchant` from https://pyenchant.github.io/pyenchant/ and\n",
    "    user language detection model.\n",
    "\n",
    "    MS words detection are using `malaya.dictionary.is_malay` and\n",
    "    user language detection model.\n",
    "\n",
    "    OTHER words detection are using any language detection classification model, such as,\n",
    "    `malaya.language_detection.fasttext`.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    model : Callable\n",
    "        Callable model, must have `predict` method.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result : malaya.model.rules.LanguageDict class\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.\n"
     ]
    }
   ],
   "source": [
    "fasttext = malaya.language_detection.fasttext()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = malaya.language_detection.substring_rules(model = fasttext)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predict\n",
    "\n",
    "```python\n",
    "def predict(\n",
    "    self,\n",
    "    words: List[str],\n",
    "    acceptable_ms_label: List[str] = ['malay', 'ind'],\n",
    "    acceptable_en_label: List[str] = ['eng', 'manglish'],\n",
    "    use_is_malay: bool = True,\n",
    "):\n",
    "    \"\"\"\n",
    "    Predict [EN, MS, OTHERS, CAPITAL, NOT_LANG] on word level. \n",
    "    This method assumed the string already tokenized.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    words: List[str]\n",
    "    acceptable_ms_label: List[str], optional (default = ['malay', 'ind'])\n",
    "        accept labels from language detection model to assume a word is `MS`.\n",
    "    acceptable_en_label: List[str], optional (default = ['eng', 'manglish'])\n",
    "        accept labels from language detection model to assume a word is `EN`.\n",
    "    use_is_malay: bool, optional (default=True)\n",
    "        if True`, will predict MS word using `malaya.dictionary.is_malay`, \n",
    "        else use language detection model.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: List[str]\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('saya', 'MS'),\n",
       " ('suka', 'MS'),\n",
       " ('chicken', 'EN'),\n",
       " ('and', 'EN'),\n",
       " ('fish', 'EN'),\n",
       " ('pda', 'MS'),\n",
       " ('hari', 'MS'),\n",
       " ('isnin', 'MS')]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string = 'saya suka chicken and fish pda hari isnin'\n",
    "splitted = string.split()\n",
    "list(zip(splitted, model.predict(splitted)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('saya', 'MS'),\n",
       " ('suka', 'MS'),\n",
       " ('chicken', 'EN'),\n",
       " ('and', 'EN'),\n",
       " ('fish', 'EN'),\n",
       " ('pda', 'MS'),\n",
       " ('hari', 'MS'),\n",
       " ('isnin', 'MS'),\n",
       " (',', 'NOT_LANG'),\n",
       " ('tarikh', 'MS'),\n",
       " ('22', 'NOT_LANG'),\n",
       " ('mei', 'MS')]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string = 'saya suka chicken and fish pda hari isnin , tarikh 22 mei'\n",
    "splitted = string.split()\n",
    "list(zip(splitted, model.predict(splitted)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('saya', 'MS'),\n",
       " ('suka', 'MS'),\n",
       " ('chicken', 'EN'),\n",
       " ('🐔', 'NOT_LANG'),\n",
       " ('and', 'EN'),\n",
       " ('fish', 'EN'),\n",
       " ('pda', 'MS'),\n",
       " ('hari', 'MS'),\n",
       " ('isnin', 'MS'),\n",
       " (',', 'NOT_LANG'),\n",
       " ('tarikh', 'MS'),\n",
       " ('22', 'NOT_LANG'),\n",
       " ('mei', 'MS')]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string = 'saya suka chicken 🐔 and fish pda hari isnin , tarikh 22 mei'\n",
    "splitted = string.split()\n",
    "list(zip(splitted, model.predict(splitted)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use malaya.preprocessing.Tokenizer\n",
    "\n",
    "To get better word tokens!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "string = 'Terminal 1 KKIA dilengkapi kemudahan 64 kaunter daftar masuk, 12 aero bridge selain mampu menampung 3,200 penumpang dalam satu masa.'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Terminal',\n",
       " '1',\n",
       " 'KKIA',\n",
       " 'dilengkapi',\n",
       " 'kemudahan',\n",
       " '64',\n",
       " 'kaunter',\n",
       " 'daftar',\n",
       " 'masuk',\n",
       " ',',\n",
       " '12',\n",
       " 'aero',\n",
       " 'bridge',\n",
       " 'selain',\n",
       " 'mampu',\n",
       " 'menampung',\n",
       " '3,200',\n",
       " 'penumpang',\n",
       " 'dalam',\n",
       " 'satu',\n",
       " 'masa',\n",
       " '.']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer = malaya.preprocessing.Tokenizer()\n",
    "tokenized = tokenizer.tokenize(string)\n",
    "tokenized"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Terminal', 'MS'),\n",
       " ('1', 'NOT_LANG'),\n",
       " ('KKIA', 'CAPITAL'),\n",
       " ('dilengkapi', 'MS'),\n",
       " ('kemudahan', 'MS'),\n",
       " ('64', 'NOT_LANG'),\n",
       " ('kaunter', 'MS'),\n",
       " ('daftar', 'MS'),\n",
       " ('masuk', 'MS'),\n",
       " (',', 'NOT_LANG'),\n",
       " ('12', 'NOT_LANG'),\n",
       " ('aero', 'OTHERS'),\n",
       " ('bridge', 'EN'),\n",
       " ('selain', 'MS'),\n",
       " ('mampu', 'MS'),\n",
       " ('menampung', 'MS'),\n",
       " ('3,200', 'NOT_LANG'),\n",
       " ('penumpang', 'MS'),\n",
       " ('dalam', 'MS'),\n",
       " ('satu', 'MS'),\n",
       " ('masa', 'MS'),\n",
       " ('.', 'NOT_LANG')]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(zip(tokenized, model.predict(tokenized)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If not properly tokenized the string,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Terminal', 'MS'),\n",
       " ('1', 'NOT_LANG'),\n",
       " ('KKIA', 'CAPITAL'),\n",
       " ('dilengkapi', 'MS'),\n",
       " ('kemudahan', 'MS'),\n",
       " ('64', 'NOT_LANG'),\n",
       " ('kaunter', 'MS'),\n",
       " ('daftar', 'MS'),\n",
       " ('masuk,', 'OTHERS'),\n",
       " ('12', 'NOT_LANG'),\n",
       " ('aero', 'OTHERS'),\n",
       " ('bridge', 'EN'),\n",
       " ('selain', 'MS'),\n",
       " ('mampu', 'MS'),\n",
       " ('menampung', 'MS'),\n",
       " ('3,200', 'NOT_LANG'),\n",
       " ('penumpang', 'MS'),\n",
       " ('dalam', 'MS'),\n",
       " ('satu', 'MS'),\n",
       " ('masa.', 'OTHERS')]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "splitted = string.split()\n",
    "list(zip(splitted, model.predict(splitted)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### More example\n",
    "\n",
    "Copy pasted from Twitter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = \"just attended my cousin's wedding. pelik jugak dia buat majlis biasa2 je sebab her lifestyle looks lavish. then i found out they're going on a 3 weeks honeymoon. smart decision 👍\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('just', 'EN'),\n",
       " ('attended', 'EN'),\n",
       " ('my', 'EN'),\n",
       " (\"cousin's\", 'EN'),\n",
       " ('wedding', 'EN'),\n",
       " ('.', 'NOT_LANG'),\n",
       " ('pelik', 'MS'),\n",
       " ('jugak', 'MS'),\n",
       " ('dia', 'MS'),\n",
       " ('buat', 'MS'),\n",
       " ('majlis', 'MS'),\n",
       " ('biasa2', 'OTHERS'),\n",
       " ('je', 'MS'),\n",
       " ('sebab', 'MS'),\n",
       " ('her', 'EN'),\n",
       " ('lifestyle', 'EN'),\n",
       " ('looks', 'EN'),\n",
       " ('lavish', 'EN'),\n",
       " ('.', 'NOT_LANG'),\n",
       " ('then', 'EN'),\n",
       " ('i', 'MS'),\n",
       " ('found', 'EN'),\n",
       " ('out', 'EN'),\n",
       " (\"they'\", 'OTHERS'),\n",
       " ('re', 'EN'),\n",
       " ('going', 'EN'),\n",
       " ('on', 'EN'),\n",
       " ('a', 'EN'),\n",
       " ('3', 'NOT_LANG'),\n",
       " ('weeks', 'EN'),\n",
       " ('honeymoon', 'EN'),\n",
       " ('.', 'NOT_LANG'),\n",
       " ('smart', 'EN'),\n",
       " ('decision', 'EN'),\n",
       " ('👍', 'NOT_LANG')]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenized = tokenizer.tokenize(s)\n",
    "list(zip(tokenized, model.predict(tokenized)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Hello', 'EN'),\n",
       " ('gais', 'MS'),\n",
       " (',', 'NOT_LANG'),\n",
       " ('boleh', 'MS'),\n",
       " ('tolong', 'MS'),\n",
       " ('recommend', 'EN'),\n",
       " ('bengkel', 'MS'),\n",
       " ('ketuk', 'MS'),\n",
       " ('yang', 'MS'),\n",
       " ('okay', 'EN'),\n",
       " ('near', 'EN'),\n",
       " ('Wangsa', 'MS'),\n",
       " ('Maju', 'MS'),\n",
       " ('/', 'NOT_LANG'),\n",
       " ('nearby', 'EN'),\n",
       " ('?', 'NOT_LANG'),\n",
       " ('Kereta', 'MS'),\n",
       " ('bf', 'MS'),\n",
       " ('i', 'MS'),\n",
       " ('pulak', 'MS'),\n",
       " ('kepek', 'MS'),\n",
       " ('langgar', 'MS'),\n",
       " ('dinding', 'MS'),\n",
       " ('hahahha', 'NOT_LANG')]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = 'Hello gais, boleh tolong recommend bengkel ketuk yang okay near Wangsa Maju / nearby? Kereta bf i pulak kepek langgar dinding hahahha'\n",
    "tokenized = tokenizer.tokenize(s)\n",
    "list(zip(tokenized, model.predict(tokenized)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Me', 'EN'),\n",
       " ('after', 'EN'),\n",
       " ('seeing', 'EN'),\n",
       " ('this', 'EN'),\n",
       " ('video', 'MS'),\n",
       " (':', 'NOT_LANG'),\n",
       " ('mm', 'EN'),\n",
       " ('dapnya', 'MS'),\n",
       " ('burger', 'MS'),\n",
       " ('benjo', 'OTHERS'),\n",
       " ('extra', 'EN'),\n",
       " ('mayo', 'EN')]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = 'Me after seeing this video: mm dapnya burger benjo extra mayo'\n",
    "tokenized = tokenizer.tokenize(s)\n",
    "list(zip(tokenized, model.predict(tokenized)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Hi', 'EN'),\n",
       " ('guys', 'EN'),\n",
       " ('!', 'NOT_LANG'),\n",
       " ('I', 'CAPITAL'),\n",
       " ('noticed', 'EN'),\n",
       " ('semalam', 'MS'),\n",
       " ('&', 'NOT_LANG'),\n",
       " ('harini', 'MS'),\n",
       " ('dah', 'MS'),\n",
       " ('ramai', 'MS'),\n",
       " ('yang', 'MS'),\n",
       " ('dapat', 'MS'),\n",
       " ('cookies', 'EN'),\n",
       " ('ni', 'MS'),\n",
       " ('kan', 'MS'),\n",
       " ('.', 'NOT_LANG'),\n",
       " ('So', 'MS'),\n",
       " ('harini', 'MS'),\n",
       " ('i', 'MS'),\n",
       " ('nak', 'MS'),\n",
       " ('share', 'EN'),\n",
       " ('some', 'EN'),\n",
       " ('post', 'MS'),\n",
       " ('mortem', 'MS'),\n",
       " ('of', 'EN'),\n",
       " ('our', 'EN'),\n",
       " ('first', 'EN'),\n",
       " ('batch', 'EN'),\n",
       " (':', 'NOT_LANG')]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = 'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:'\n",
    "tokenized = tokenizer.tokenize(s)\n",
    "list(zip(tokenized, model.predict(tokenized)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
